Module 04: Lab 01

Visual Reporting and Storytelling

visualization
plotly
spark
Visual Reporting
Storytelling with Data
Industry-Specific Visualization
Author

Binderiya Dugersuren

Published

March 23, 2025

Modified

March 25, 2025

Objectives

By the end of this lab, you will: 1. Load and analyze the Lightcast dataset in Spark DataFrame. 2. Create five easy and three medium-complexity visualizations using Plotly. 3. Explore salary distributions, employment trends, and job postings. 4. Analyze skills in relation to NAICS/SOC/ONET codes and salaries. 5. Customize colors, fonts, and styles in all visualizations (default themes result in a 2.5-point deduction). 6. Follow best practices for reporting on data communication.

Step 1: Load the Dataset

import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "vscode"
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


# Initialize Spark Session
spark = SparkSession.builder.appName("LightcastData").getOrCreate()

# Load Data
df = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("lightcast_job_postings.csv")

# Show Schema and Sample Data
df.printSchema()
df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/25 02:58:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                25/03/25 02:59:15 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                
root
 |-- ID: string (nullable = true)
 |-- LAST_UPDATED_DATE: string (nullable = true)
 |-- LAST_UPDATED_TIMESTAMP: timestamp (nullable = true)
 |-- DUPLICATES: integer (nullable = true)
 |-- POSTED: string (nullable = true)
 |-- EXPIRED: string (nullable = true)
 |-- DURATION: integer (nullable = true)
 |-- SOURCE_TYPES: string (nullable = true)
 |-- SOURCES: string (nullable = true)
 |-- URL: string (nullable = true)
 |-- ACTIVE_URLS: string (nullable = true)
 |-- ACTIVE_SOURCES_INFO: string (nullable = true)
 |-- TITLE_RAW: string (nullable = true)
 |-- BODY: string (nullable = true)
 |-- MODELED_EXPIRED: string (nullable = true)
 |-- MODELED_DURATION: integer (nullable = true)
 |-- COMPANY: integer (nullable = true)
 |-- COMPANY_NAME: string (nullable = true)
 |-- COMPANY_RAW: string (nullable = true)
 |-- COMPANY_IS_STAFFING: boolean (nullable = true)
 |-- EDUCATION_LEVELS: string (nullable = true)
 |-- EDUCATION_LEVELS_NAME: string (nullable = true)
 |-- MIN_EDULEVELS: integer (nullable = true)
 |-- MIN_EDULEVELS_NAME: string (nullable = true)
 |-- MAX_EDULEVELS: integer (nullable = true)
 |-- MAX_EDULEVELS_NAME: string (nullable = true)
 |-- EMPLOYMENT_TYPE: integer (nullable = true)
 |-- EMPLOYMENT_TYPE_NAME: string (nullable = true)
 |-- MIN_YEARS_EXPERIENCE: integer (nullable = true)
 |-- MAX_YEARS_EXPERIENCE: integer (nullable = true)
 |-- IS_INTERNSHIP: boolean (nullable = true)
 |-- SALARY: integer (nullable = true)
 |-- REMOTE_TYPE: integer (nullable = true)
 |-- REMOTE_TYPE_NAME: string (nullable = true)
 |-- ORIGINAL_PAY_PERIOD: string (nullable = true)
 |-- SALARY_TO: integer (nullable = true)
 |-- SALARY_FROM: integer (nullable = true)
 |-- LOCATION: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- CITY_NAME: string (nullable = true)
 |-- COUNTY: integer (nullable = true)
 |-- COUNTY_NAME: string (nullable = true)
 |-- MSA: integer (nullable = true)
 |-- MSA_NAME: string (nullable = true)
 |-- STATE: integer (nullable = true)
 |-- STATE_NAME: string (nullable = true)
 |-- COUNTY_OUTGOING: integer (nullable = true)
 |-- COUNTY_NAME_OUTGOING: string (nullable = true)
 |-- COUNTY_INCOMING: integer (nullable = true)
 |-- COUNTY_NAME_INCOMING: string (nullable = true)
 |-- MSA_OUTGOING: integer (nullable = true)
 |-- MSA_NAME_OUTGOING: string (nullable = true)
 |-- MSA_INCOMING: integer (nullable = true)
 |-- MSA_NAME_INCOMING: string (nullable = true)
 |-- NAICS2: integer (nullable = true)
 |-- NAICS2_NAME: string (nullable = true)
 |-- NAICS3: integer (nullable = true)
 |-- NAICS3_NAME: string (nullable = true)
 |-- NAICS4: integer (nullable = true)
 |-- NAICS4_NAME: string (nullable = true)
 |-- NAICS5: integer (nullable = true)
 |-- NAICS5_NAME: string (nullable = true)
 |-- NAICS6: integer (nullable = true)
 |-- NAICS6_NAME: string (nullable = true)
 |-- TITLE: string (nullable = true)
 |-- TITLE_NAME: string (nullable = true)
 |-- TITLE_CLEAN: string (nullable = true)
 |-- SKILLS: string (nullable = true)
 |-- SKILLS_NAME: string (nullable = true)
 |-- SPECIALIZED_SKILLS: string (nullable = true)
 |-- SPECIALIZED_SKILLS_NAME: string (nullable = true)
 |-- CERTIFICATIONS: string (nullable = true)
 |-- CERTIFICATIONS_NAME: string (nullable = true)
 |-- COMMON_SKILLS: string (nullable = true)
 |-- COMMON_SKILLS_NAME: string (nullable = true)
 |-- SOFTWARE_SKILLS: string (nullable = true)
 |-- SOFTWARE_SKILLS_NAME: string (nullable = true)
 |-- ONET: string (nullable = true)
 |-- ONET_NAME: string (nullable = true)
 |-- ONET_2019: string (nullable = true)
 |-- ONET_2019_NAME: string (nullable = true)
 |-- CIP6: string (nullable = true)
 |-- CIP6_NAME: string (nullable = true)
 |-- CIP4: string (nullable = true)
 |-- CIP4_NAME: string (nullable = true)
 |-- CIP2: string (nullable = true)
 |-- CIP2_NAME: string (nullable = true)
 |-- SOC_2021_2: string (nullable = true)
 |-- SOC_2021_2_NAME: string (nullable = true)
 |-- SOC_2021_3: string (nullable = true)
 |-- SOC_2021_3_NAME: string (nullable = true)
 |-- SOC_2021_4: string (nullable = true)
 |-- SOC_2021_4_NAME: string (nullable = true)
 |-- SOC_2021_5: string (nullable = true)
 |-- SOC_2021_5_NAME: string (nullable = true)
 |-- LOT_CAREER_AREA: integer (nullable = true)
 |-- LOT_CAREER_AREA_NAME: string (nullable = true)
 |-- LOT_OCCUPATION: integer (nullable = true)
 |-- LOT_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_SPECIALIZED_OCCUPATION: integer (nullable = true)
 |-- LOT_SPECIALIZED_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_OCCUPATION_GROUP: integer (nullable = true)
 |-- LOT_OCCUPATION_GROUP_NAME: string (nullable = true)
 |-- LOT_V6_SPECIALIZED_OCCUPATION: integer (nullable = true)
 |-- LOT_V6_SPECIALIZED_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_V6_OCCUPATION: integer (nullable = true)
 |-- LOT_V6_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_V6_OCCUPATION_GROUP: integer (nullable = true)
 |-- LOT_V6_OCCUPATION_GROUP_NAME: string (nullable = true)
 |-- LOT_V6_CAREER_AREA: integer (nullable = true)
 |-- LOT_V6_CAREER_AREA_NAME: string (nullable = true)
 |-- SOC_2: string (nullable = true)
 |-- SOC_2_NAME: string (nullable = true)
 |-- SOC_3: string (nullable = true)
 |-- SOC_3_NAME: string (nullable = true)
 |-- SOC_4: string (nullable = true)
 |-- SOC_4_NAME: string (nullable = true)
 |-- SOC_5: string (nullable = true)
 |-- SOC_5_NAME: string (nullable = true)
 |-- LIGHTCAST_SECTORS: string (nullable = true)
 |-- LIGHTCAST_SECTORS_NAME: string (nullable = true)
 |-- NAICS_2022_2: integer (nullable = true)
 |-- NAICS_2022_2_NAME: string (nullable = true)
 |-- NAICS_2022_3: integer (nullable = true)
 |-- NAICS_2022_3_NAME: string (nullable = true)
 |-- NAICS_2022_4: integer (nullable = true)
 |-- NAICS_2022_4_NAME: string (nullable = true)
 |-- NAICS_2022_5: integer (nullable = true)
 |-- NAICS_2022_5_NAME: string (nullable = true)
 |-- NAICS_2022_6: integer (nullable = true)
 |-- NAICS_2022_6_NAME: string (nullable = true)

+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|                  ID|LAST_UPDATED_DATE|LAST_UPDATED_TIMESTAMP|DUPLICATES|  POSTED|  EXPIRED|DURATION|        SOURCE_TYPES|             SOURCES|                 URL|ACTIVE_URLS|ACTIVE_SOURCES_INFO|           TITLE_RAW|                BODY|MODELED_EXPIRED|MODELED_DURATION| COMPANY|        COMPANY_NAME|COMPANY_RAW|COMPANY_IS_STAFFING|EDUCATION_LEVELS|EDUCATION_LEVELS_NAME|MIN_EDULEVELS| MIN_EDULEVELS_NAME|MAX_EDULEVELS|MAX_EDULEVELS_NAME|EMPLOYMENT_TYPE|EMPLOYMENT_TYPE_NAME|MIN_YEARS_EXPERIENCE|MAX_YEARS_EXPERIENCE|IS_INTERNSHIP|SALARY|REMOTE_TYPE|REMOTE_TYPE_NAME|ORIGINAL_PAY_PERIOD|SALARY_TO|SALARY_FROM|            LOCATION|                CITY|    CITY_NAME|COUNTY|   COUNTY_NAME|  MSA|            MSA_NAME|STATE|STATE_NAME|COUNTY_OUTGOING|COUNTY_NAME_OUTGOING|COUNTY_INCOMING|COUNTY_NAME_INCOMING|MSA_OUTGOING|   MSA_NAME_OUTGOING|MSA_INCOMING|   MSA_NAME_INCOMING|NAICS2|         NAICS2_NAME|NAICS3|         NAICS3_NAME|NAICS4|         NAICS4_NAME|NAICS5|         NAICS5_NAME|NAICS6|         NAICS6_NAME|             TITLE|         TITLE_NAME|         TITLE_CLEAN|              SKILLS|         SKILLS_NAME|  SPECIALIZED_SKILLS|SPECIALIZED_SKILLS_NAME|      CERTIFICATIONS| CERTIFICATIONS_NAME|       COMMON_SKILLS|  COMMON_SKILLS_NAME|     SOFTWARE_SKILLS|SOFTWARE_SKILLS_NAME|      ONET|           ONET_NAME| ONET_2019|      ONET_2019_NAME|                CIP6|           CIP6_NAME|                CIP4|           CIP4_NAME|                CIP2|           CIP2_NAME|SOC_2021_2|     SOC_2021_2_NAME|SOC_2021_3|     SOC_2021_3_NAME|SOC_2021_4|SOC_2021_4_NAME|SOC_2021_5|SOC_2021_5_NAME|LOT_CAREER_AREA|LOT_CAREER_AREA_NAME|LOT_OCCUPATION| LOT_OCCUPATION_NAME|LOT_SPECIALIZED_OCCUPATION|LOT_SPECIALIZED_OCCUPATION_NAME|LOT_OCCUPATION_GROUP|LOT_OCCUPATION_GROUP_NAME|LOT_V6_SPECIALIZED_OCCUPATION|LOT_V6_SPECIALIZED_OCCUPATION_NAME|LOT_V6_OCCUPATION|LOT_V6_OCCUPATION_NAME|LOT_V6_OCCUPATION_GROUP|LOT_V6_OCCUPATION_GROUP_NAME|LOT_V6_CAREER_AREA|LOT_V6_CAREER_AREA_NAME|  SOC_2|          SOC_2_NAME|  SOC_3|          SOC_3_NAME|  SOC_4|     SOC_4_NAME|  SOC_5|     SOC_5_NAME|LIGHTCAST_SECTORS|LIGHTCAST_SECTORS_NAME|NAICS_2022_2|   NAICS_2022_2_NAME|NAICS_2022_3|   NAICS_2022_3_NAME|NAICS_2022_4|   NAICS_2022_4_NAME|NAICS_2022_5|   NAICS_2022_5_NAME|NAICS_2022_6|   NAICS_2022_6_NAME|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|1f57d95acf4dc67ed...|         9/6/2024|  2024-09-06 20:32:...|         0|6/2/2024| 6/8/2024|       6|   [\n  "Company"\n]|[\n  "brassring.c...|[\n  "https://sjo...|         []|               NULL|Enterprise Analys...|31-May-2024\n\nEn...|       6/8/2024|               6|  894731|          Murphy USA| Murphy USA|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   2|                   2|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.20...|RWwgRG9yYWRvLCBBUg==|El Dorado, AR|  5139|     Union, AR|20980|       El Dorado, AR|    5|  Arkansas|           5139|           Union, AR|           5139|           Union, AR|       20980|       El Dorado, AR|       20980|       El Dorado, AR|    44|        Retail Trade|   441|Motor Vehicle and...|  4413|Automotive Parts,...| 44133|Automotive Parts ...|441330|Automotive Parts ...|ET29C073C03D1F86B4|Enterprise Analysts|enterprise analys...|[\n  "KS126DB6T06...|[\n  "Merchandisi...|[\n  "KS126DB6T06...|   [\n  "Merchandisi...|                  []|                  []|[\n  "KS126706DPF...|[\n  "Mathematics...|[\n  "KS440W865GC...|[\n  "SQL (Progra...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|[\n  "45.0601",\n...|[\n  "Economics, ...|[\n  "45.06",\n  ...|[\n  "Economics",...|[\n  "45",\n  "27...|[\n  "Social Scie...|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101011|           General ERP Analy...|                2310|     Business Intellig...|                     23101011|              General ERP Analy...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  7\n]|  [\n  "Artificial ...|          44|        Retail Trade|         441|Motor Vehicle and...|        4413|Automotive Parts,...|       44133|Automotive Parts ...|      441330|Automotive Parts ...|
|0cb072af26757b6c4...|         8/2/2024|  2024-08-02 17:08:...|         0|6/2/2024| 8/1/2024|    NULL| [\n  "Job Board"\n]| [\n  "maine.gov"\n]|[\n  "https://job...|         []|               NULL|Oracle Consultant...|Oracle Consultant...|       8/1/2024|            NULL|  133098|Smx Corporation L...|        SMX|               true|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                   3|        false|  NULL|          1|          Remote|               NULL|     NULL|       NULL|{\n  "lat": 44.31...|    QXVndXN0YSwgTUU=|  Augusta, ME| 23011|  Kennebec, ME|12300|Augusta-Watervill...|   23|     Maine|          23011|        Kennebec, ME|          23011|        Kennebec, ME|       12300|Augusta-Watervill...|       12300|Augusta-Watervill...|    56|Administrative an...|   561|Administrative an...|  5613| Employment Services| 56132|Temporary Help Se...|561320|Temporary Help Se...|ET21DDA63780A7DC09| Oracle Consultants|oracle consultant...|[\n  "KS122626T55...|[\n  "Procurement...|[\n  "KS122626T55...|   [\n  "Procurement...|                  []|                  []|                  []|                  []|[\n  "BGSBF3F508F...|[\n  "Oracle Busi...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          56|Administrative an...|         561|Administrative an...|        5613| Employment Services|       56132|Temporary Help Se...|      561320|Temporary Help Se...|
|85318b12b3331fa49...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024| 7/7/2024|      35| [\n  "Job Board"\n]|[\n  "dejobs.org"\n]|[\n  "https://dej...|         []|               NULL|        Data Analyst|Taking care of pe...|      6/10/2024|               8|39063746|            Sedgwick|   Sedgwick|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   5|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 32.77...|    RGFsbGFzLCBUWA==|   Dallas, TX| 48113|    Dallas, TX|19100|Dallas-Fort Worth...|   48|     Texas|          48113|          Dallas, TX|          48113|          Dallas, TX|       19100|Dallas-Fort Worth...|       19100|Dallas-Fort Worth...|    52|Finance and Insur...|   524|Insurance Carrier...|  5242|Agencies, Brokera...| 52429|Other Insurance R...|524291|    Claims Adjusting|ET3037E0C947A02404|      Data Analysts|        data analyst|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "ESF3939CE1F...|   [\n  "Exception R...|[\n  "KS683TN76T7...|[\n  "Security Cl...|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "KS126HY6YLT...|[\n  "Microsoft O...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          52|Finance and Insur...|         524|Insurance Carrier...|        5242|Agencies, Brokera...|       52429|Other Insurance R...|      524291|    Claims Adjusting|
|1b5c3941e54a1889e...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024|7/20/2024|      48| [\n  "Job Board"\n]|[\n  "disabledper...|[\n  "https://www...|         []|               NULL|Sr. Lead Data Mgm...|About this role:\...|      6/12/2024|              10|37615159|         Wells Fargo|Wells Fargo|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.44...|    UGhvZW5peCwgQVo=|  Phoenix, AZ|  4013|  Maricopa, AZ|38060|Phoenix-Mesa-Chan...|    4|   Arizona|           4013|        Maricopa, AZ|           4013|        Maricopa, AZ|       38060|Phoenix-Mesa-Chan...|       38060|Phoenix-Mesa-Chan...|    52|Finance and Insur...|   522|Credit Intermedia...|  5221|Depository Credit...| 52211|  Commercial Banking|522110|  Commercial Banking|ET2114E0404BA30075|Management Analysts|sr lead data mgmt...|[\n  "KS123QX62QY...|[\n  "Exit Strate...|[\n  "KS123QX62QY...|   [\n  "Exit Strate...|                  []|                  []|[\n  "KS7G6NP6R6L...|[\n  "Reliability...|[\n  "KS4409D76NW...|[\n  "SAS (Softwa...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  6\n]|  [\n  "Data Privac...|          52|Finance and Insur...|         522|Credit Intermedia...|        5221|Depository Credit...|       52211|  Commercial Banking|      522110|  Commercial Banking|
|cb5ca25f02bdf25c1...|        6/19/2024|   2024-06-19 07:00:00|         0|6/2/2024|6/17/2024|      15|[\n  "FreeJobBoar...|[\n  "craigslist....|[\n  "https://mod...|         []|               NULL|Comisiones de $10...|Comisiones de $10...|      6/17/2024|              15|       0|        Unclassified|      LH/GM|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              3|Part-time / full-...|                NULL|                NULL|        false| 92500|          0|          [None]|               year|   150000|      35000|{\n  "lat": 37.63...|    TW9kZXN0bywgQ0E=|  Modesto, CA|  6099|Stanislaus, CA|33700|         Modesto, CA|    6|California|           6099|      Stanislaus, CA|           6099|      Stanislaus, CA|       33700|         Modesto, CA|       33700|         Modesto, CA|    99|Unclassified Indu...|   999|Unclassified Indu...|  9999|Unclassified Indu...| 99999|Unclassified Indu...|999999|Unclassified Indu...|ET0000000000000000|       Unclassified|comisiones de por...|                  []|                  []|                  []|                     []|                  []|                  []|                  []|                  []|                  []|                  []|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          99|Unclassified Indu...|         999|Unclassified Indu...|        9999|Unclassified Indu...|       99999|Unclassified Indu...|      999999|Unclassified Indu...|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
only showing top 5 rows

1 Salary Distribution by Employment Type

  • Identify salary trends across different employment types.
  • Filter the dataset
    • Remove records where salary is missing or zero.
  • Aggregate Data
    • Group by employment type and compute salary distribution.
  • Visualize results
    • Create a box plot where:
      • X-axis = EMPLOYMENT_TYPE_NAME
      • Y-axis = SALARY_FROM
    • Customize colors, fonts, and styles to avoid a 2.5-point deduction.
  • Explanation: Write two sentences about what the graph reveals.
from pyspark.sql.functions import avg, min, max, count
filtered_df = df.filter((col("SALARY").isNotNull()) & (col("SALARY") != 0))

aggregated_df = filtered_df.groupBy("EMPLOYMENT_TYPE_NAME").agg(
    count("*").alias("job_count"),
    avg("SALARY").alias("avg_salary"),
    min("SALARY").alias("min_salary"),
    max("SALARY").alias("max_salary")
)

aggregated_df.show()
[Stage 3:>                                                          (0 + 1) / 1]                                                                                
+--------------------+---------+------------------+----------+----------+
|EMPLOYMENT_TYPE_NAME|job_count|        avg_salary|min_salary|max_salary|
+--------------------+---------+------------------+----------+----------+
|Part-time / full-...|      619| 105621.2423263328|     20800|    455375|
|Part-time (≤ 32...|     1038| 98802.50963391137|     15860|    310050|
|Full-time (> 32 h...|    29151|118897.55860862407|     20583|    500000|
+--------------------+---------+------------------+----------+----------+
filtered_df.select("EMPLOYMENT_TYPE_NAME", "SALARY") \
           .write.mode("overwrite") \
           .option("header", True) \
           .csv("filtered_salary_data")

import pandas as pd
import glob

files = glob.glob("filtered_salary_data/part-*.csv")
pdf = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)
                                                                                
pdf["EMPLOYMENT_TYPE_NAME"] = pdf["EMPLOYMENT_TYPE_NAME"].replace({
    "Part-time (≤ 32 hours)": "Part-time (≤ 32 hours)",
    "Full-time (≥ 32 hours)": "Full-time (≥ 32 hours)",
    "Part-time / full-time": "Part-time / full-time"
})
# Your Code for 1st question here
#pdf = df.select("EMPLOYMENT_TYPE_NAME", "SALARY").toPandas()
import plotly.io as pio
pio.renderers.default = "notebook_connected"

fig = px.box(pdf, x="EMPLOYMENT_TYPE_NAME", y="SALARY", 
             title="Salary Distribution by Employment Type", 
             color = "EMPLOYMENT_TYPE_NAME",
             color_discrete_sequence= ["yellow", "red", "coral"],
             labels={
        "EMPLOYMENT_TYPE_NAME": "Employment Type",
        "SALARY": "Salary"
    })
fig.update_layout(font_family="Aptos", title_font_size=17, 
                  xaxis=dict(title=dict(text='Employment Type'), zeroline=False),
                  yaxis=dict(title=dict(text='Salary'), zeroline=False))


fig.show()
import os
fig.write_image("_output/figure.svg")

import kaleido
print("Kaleido version:", kaleido.__version__)
ValueError: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido

First I cleaned up the employment type labels in the DataFrame for readability in the plot. From the box plot, we can observe that full-time employment has a significant number of outliers on the higher end of the salary distribution. In contrast, part-time (≤ 32 hours) shows a more tightly concentrated distribution with fewer extreme values. The interquartile range for part-time/full-time jobs appears narrower, indicating lower variability within that category. Also, the median salary for full-time roles is the highest among all groups, while part-time roles have the lowest median salaries.

2 Salary Distribution by Industry

  • Compare salary variations across industries.
  • Filter the dataset
    • Keep records where salary is greater than zero.
  • Aggregate Data
    • Group by NAICS industry codes.
  • Visualize results
    • Create a box plot where:
      • X-axis = NAICS2_NAME
      • Y-axis = SALARY_FROM
    • Customize colors, fonts, and styles.
  • Explanation: Write two sentences about what the graph reveals.
# Your code for 2nd question here
filtered_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") != 0))

industry_df = filtered_df.groupBy("NAICS2_NAME").agg(
    count("*").alias("job_count"))

industry_df.show() 
[Stage 7:>                                                          (0 + 1) / 1]                                                                                
+--------------------+---------+
|         NAICS2_NAME|job_count|
+--------------------+---------+
|Administrative an...|     4040|
|Public Administra...|      737|
|Real Estate and R...|      454|
|         Information|     2356|
|Unclassified Indu...|     3811|
|Accommodation and...|      270|
|Finance and Insur...|     4013|
|        Construction|      299|
|           Utilities|      343|
|Management of Com...|       41|
|Professional, Sci...|     9282|
|Arts, Entertainme...|       90|
|Other Services (e...|      385|
|Transportation an...|      245|
|     Wholesale Trade|      943|
|Agriculture, Fore...|       29|
|       Manufacturing|     1740|
|Mining, Quarrying...|       38|
|Educational Services|     1033|
|Health Care and S...|     1442|
+--------------------+---------+
only showing top 20 rows
filtered_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") > 0)& 
                        (col("NAICS2_NAME").isin("Health Care and Social Assistance",
                                                  "Finance and Insurance", "Information")))
salary_df = filtered_df.select("NAICS2_NAME", "SALARY_FROM")
salary_df.write.mode("overwrite").option("header", True).csv("industry_salaries")
files = glob.glob("industry_salaries/part-*.csv")
pdf = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)
pio.renderers.default = "notebook_connected"

fig = px.box(
    pdf,
    x="NAICS2_NAME",
    y="SALARY_FROM",
    title="Salary Distribution by Industry",
    color="NAICS2_NAME",
    color_discrete_sequence = ['rgb(67,67,67)', 'rgb(115,115,115)', 'rgb(49,130,189)']
    
)

fig.update_layout(
    font_family="Aptos",
    title_font_size=17,
    xaxis=dict(title=dict(text='Industry Type'), zeroline=False),
    yaxis=dict(title=dict(text='Salary'), zeroline=False)
)

fig.show()
                                                                                

From this box plot, we can observe the salary distributions for the Information, Finance & Insurance, and Healthcare & Social Assistance industries. Among these, the Information industry has the highest median salary, while the Healthcare & Social Assistance industry shows the lowest median salary. But, also we see a wide distribution of data in the healthcare industry. This suggests that job roles in the Information sector tend to be more highly compensated compared to the other two industries.

4 Top 10 Job Titles by Count

  • Identify the most frequently posted job titles.
  • Aggregate Data
    • Count the occurrences of each job title (TITLE_NAME).
    • Select the top 10 most frequent titles.
  • Visualize results
    • Create a bar chart where:
      • X-axis = TITLE_NAME
      • Y-axis = Job Count
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
job_title = df.filter(col("TITLE_NAME").isNotNull()) \
                .groupBy("TITLE_NAME") \
                .agg(count("*").alias("job_count"))
top_10_jobs = job_title.orderBy("job_count", ascending=False).limit(10)

top_10_jobs.show()
[Stage 14:>                                                         (0 + 1) / 1]                                                                                
+--------------------+---------+
|          TITLE_NAME|job_count|
+--------------------+---------+
|       Data Analysts|     8591|
|        Unclassified|     3149|
|Business Intellig...|     2072|
|Enterprise Archit...|     1999|
|Oracle Cloud HCM ...|     1042|
|       Data Modelers|      668|
|Data Governance A...|      628|
|Data Analytics En...|      537|
|ERP Business Anal...|      488|
|Data Quality Anal...|      467|
+--------------------+---------+
# Your code for 4th question here
top_10_jobs.write.mode("overwrite").option("header", True).csv("top_10_jobs")
files = glob.glob("top_10_jobs/part-*.csv")
pdf4 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)

pio.renderers.default = "notebook_connected"

fig = px.bar(
    pdf4,
    x="TITLE_NAME",
    y="job_count",
    title="Top 10 Jobs",
    color = "TITLE_NAME"
)

fig.update_layout(
    font_family="Aptos",
    title_font_size=17,
    xaxis_title="Top 10 Jobs",
    yaxis_title="Number of Job Postings"
)

fig.show()
                                                                                

From this bar graph, we can see that Data Analyst roles are by far the most sought after in the current job market. Among the top 10 job titles, many are data-related, but Data Analyst stands out as the most in demand position.

5 Remote vs On-Site Job Postings

  • Compare the proportion of remote and on-site job postings.
  • Aggregate Data
    • Count job postings by remote type (REMOTE_TYPE_NAME).
  • Visualize results
    • Create a pie chart where:
      • Labels = REMOTE_TYPE_NAME
      • Values = Job Count
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
# Your code for 5th question here
remote_jobs = df.filter(col("REMOTE_TYPE_NAME").isNotNull()) \
                .groupBy("REMOTE_TYPE_NAME") \
                .agg(count("*").alias("job_count"))

remote_jobs.show()
[Stage 20:>                                                         (0 + 1) / 1]                                                                                
+----------------+---------+
|REMOTE_TYPE_NAME|job_count|
+----------------+---------+
|          Remote|    12497|
|          [None]|    56570|
|      Not Remote|     1127|
|   Hybrid Remote|     2260|
+----------------+---------+
remote_jobs.write.mode("overwrite").option("header", True).csv("remote_jobs")
files = glob.glob("remote_jobs/part-*.csv")
pdf5 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)

pio.renderers.default = "notebook_connected"

custom_color = [
    "#4B0082", "#6A0DAD", "#8A2BE2", "#A678F1", "#C8A2C8", "#E6E6FA"
] 

fig = px.pie(
    pdf5,
    names="REMOTE_TYPE_NAME",  
    values="job_count",            
    title="Remote vs On-Site Jobs",
    color_discrete_sequence=custom_color
)

fig.update_layout(
    font_family="Aptos",
    title_font_size=17
)
fig.show()
                                                                                

This is a pie chart showing the percentages of remote and on site jobs. From the graph we can see that most of the jobs listed in the data, 78.1% of the jobs are listed as none, which means the job posting did not specify it. But, we can see that 17.2% jobs have been listed as remote and 3.12% have been listed as hybrid.

6 Skill Demand Analysis by Industry (Stacked Bar Chart)

  • Identify which skills are most in demand in various industries.
  • Aggregate Data
    • Extract skills from job postings.
    • Count occurrences of skills grouped by NAICS industry codes.
  • Visualize results
    • Create a stacked bar chart where:
      • X-axis = Industry
      • Y-axis = Skill Count
      • Color = Skill
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
# Your code for 6th question here
from pyspark.sql.functions import split, explode, trim

skills_df = df.filter(
    col("SKILLS_NAME").isNotNull() &
    col("NAICS2_NAME").isNotNull() &
    col("NAICS2_NAME").isin(
        "Health Care and Social Assistance",
        "Finance and Insurance",
        "Information"
    )
)

skills_df = skills_df.withColumn("SKILL", explode(split(col("SKILLS_NAME"), ",")))

skills_df = skills_df.withColumn("SKILL", trim(col("SKILL")))

industry_skills_df = skills_df.groupBy("NAICS2_NAME", "SKILL") \
                              .agg(count("*").alias("skill_count"))
import glob
industry_skills_df.write.mode("overwrite").option("header", True).csv("industry_skills")
files = glob.glob("industry_skills/part-*.csv")
pdf6 = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)

import plotly.express as px
import glob
fig = px.bar(
    pdf6,
    x="NAICS2_NAME",
    y="skill_count",
    color="SKILL",  
    title="Top Skills by Industry",
    barmode="stack"
)

fig.update_layout(
    font_family="Aptos",
    title_font_size=18,
    xaxis_title="Industry",
    yaxis_title="Skill Count"
)

fig.show()
                                                                                

From this bar graph we can see that one of the most sought after sill is data warehousing in finance and insurance industry, and the second one is the leadership skill. And in the next two industries the most sought after skill is computer science, i am not sure what level of programming or IT knowledge the companies are seeking, but I think haveing basic programming skill would be neccesary.

7 Salary Analysis by ONET Occupation Type (Bubble Chart)

  • Analyze how salaries differ across ONET occupation types.
  • Aggregate Data
    • Compute median salary for each occupation in the ONET taxonomy.
  • Visualize results
    • Create a bubble chart where:
      • X-axis = ONET_NAME
      • Y-axis = Median Salary
      • Size = Number of job postings
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
# Your code for 7th question here
onet_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") != 0) & (col("ONET_NAME").isNotNull()))

onet_groups = onet_df.select("ONET_NAME").distinct().rdd.flatMap(lambda x: x).collect()
print(onet_groups)

results = []
for onet in onet_groups:
    subset = onet_df.filter(col("ONET_NAME") == onet)
    median_salary = subset.approxQuantile("SALARY_FROM", [0.5], 0.01)[0]
    job_count = subset.count()
    results.append((onet, median_salary, job_count))
                                                                                                                                                                
['Business Intelligence Analysts']
import pandas as pd

onet_summary_df = pd.DataFrame(results, columns=["ONET_NAME", "median_salary", "job_count"])

pio.renderers.default = "notebook_connected"

fig = px.scatter(
    onet_summary_df,
    x="ONET_NAME",
    y="median_salary",
    size="job_count",
    title="Median Salary and Job Count by ONET Group",
    color="median_salary", 
    size_max=60,
    color_continuous_scale="Viridis"
)

fig.update_layout(
    font_family="Aptos",
    title_font_size=18,
    xaxis_title="ONET Job Family",
    yaxis_title="Median Salary",
    xaxis_tickangle=45
)

fig.show()

This is a bubble chart showing median salary by ONET occupation type. However, because our data has only one type of occupation type, which is Business Inteligence Analyst we get only one bubble for our bubble chart.